AITopics | Student Performance

Collaborating Authors

Student Performance

Can Language Models Teach? Teacher Explanations Improve Student Performance via Personalization

Neural Information Processing SystemsDec-26-2025, 18:08:51 GMT

A hallmark property of explainable AI models is the ability to teach other agents, communicating knowledge of how to perform a task. While Large Language Models (LLMs) perform complex reasoning by generating explanations for their predictions, it is unclear whether they also make good teachers for weaker agents. To address this, we consider a student-teacher framework between two LLM agents and study if, when, and how the teacher should intervene with natural language explanations to improve the student's performance. Since communication is expensive, we define a budget such that the teacher only communicates explanations for a fraction of the data, after which the student should perform well on its own. We decompose the teaching problem along four axes: (1) if teacher's test time intervention improve student predictions, (2) when it is worth explaining a data point, (3) how the teacher should personalize explanations to better teach the student, and (4) if teacher explanations also improve student performance on future unexplained data.

artificial intelligence, large language model, natural language, (14 more...)

Neural Information Processing Systems

Industry: Education > Assessment & Standards > Student Performance (0.72)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.83)

Add feedback

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Zbeeb, Mohammad, Hammoud, Hasan Abed Al Kader, Mukalled, Sina, Rizk, Nadine, Karnib, Fatima, Lakkis, Issam, Mohanna, Ammar, Ghanem, Bernard

arXiv.org Artificial IntelligenceDec-10-2025

The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.

benchmark, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2511.14295

Country:

Europe > Austria > Vienna (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Asia > Middle East > Lebanon > Beirut Governorate > Beirut (0.05)
(10 more...)

Genre:

Research Report (0.51)
Questionnaire & Opinion Survey (0.34)

Industry: Education > Assessment & Standards > Student Performance (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Automatic Essay Scoring and Feedback Generation in Basque Language Learning

Azurmendi, Ekhi, Arregi, Xabier, de Lacalle, Oier Lopez

arXiv.org Artificial IntelligenceDec-10-2025

This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for both scoring and explanation generation. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2512.08713

Country:

North America > United States > Florida > Miami-Dade County > Miami (0.04)
Europe > Spain > Basque Country (0.04)
Europe > Faroe Islands > Streymoy > Tórshavn (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Assessment & Standards > Student Performance (0.73)
Education > Curriculum > Subject-Specific Education (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Uncovering Students' Inquiry Patterns in GenAI-Supported Clinical Practice: An Integration of Epistemic Network Analysis and Sequential Pattern Mining

Wei, Jiameng, Dang, Dinh, Yang, Kaixun, Stokes, Emily, Mazeh, Amna, Lim, Angelina, Dai, David Wei, Moore, Joel, Fan, Yizhou, Gasevic, Danijela, Gasevic, Dragan, Chen, Guanliang

arXiv.org Artificial IntelligenceDec-9-2025

Assessment of medication history-taking has traditionally relied on human observation, limiting scalability and detailed performance data. While Generative AI (GenAI) platforms enable extensive data collection and learning analytics provide powerful methods for analyzing educational traces, these approaches remain largely underexplored in pharmacy clinical training. This study addresses this gap by applying learning analytics to understand how students develop clinical communication competencies with GenAI-powered virtual patients -- a crucial endeavor given the diversity of student cohorts, varying language backgrounds, and the limited opportunities for individualized feedback in traditional training settings. We analyzed 323 students' interaction logs across Australian and Malaysian institutions, comprising 50,871 coded utterances from 1,487 student-GenAI dialogues. Combining Epistemic Network Analysis to model inquiry co-occurrences with Sequential Pattern Mining to capture temporal sequences, we found that high performers demonstrated strategic deployment of information recognition behaviors. Specifically, high performers centered inquiry on recognizing clinically relevant information, integrating rapport-building and structural organization, while low performers remained in routine question-verification loops. Demographic factors including first-language background, prior pharmacy work experience, and institutional context, also shaped distinct inquiry patterns. These findings reveal inquiry patterns that may indicate clinical reasoning development in GenAI-assisted contexts, providing methodological insights for health professions education assessment and informing adaptive GenAI system design that supports diverse learning pathways.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2512.06018

Country:

Oceania > Australia (0.06)
South America > Uruguay > Maldonado > Maldonado (0.04)
Asia > Malaysia (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Instructional Material (1.00)

Industry:

Health & Medicine > Health Care Providers & Services (0.91)
Education > Assessment & Standards > Student Performance (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

Generating Reading Comprehension Exercises with Large Language Models for Educational Applications

Huang, Xingyu, Jiang, Fei, Xiao, Jianli

arXiv.org Artificial IntelligenceNov-25-2025

With the rapid development of large language models (LLMs), the applications of LLMs have grown substantially. In the education domain, LLMs demonstrate significant potential, particularly in automatic text generation, which enables the creation of intelligent and adaptive learning content. This paper proposes a new LLMs framework, which is named as Reading Comprehension Exercise Generation (RCEG). It can generate high-quality and personalized English reading comprehension exercises automatically. Firstly, RCEG uses fine-tuned LLMs to generate content candidates. Then, it uses a discriminator to select the best candidate. Finally, the quality of the generated content has been improved greatly. To evaluate the performance of RCEG, a dedicated dataset for English reading comprehension is constructed to perform the experiments, and comprehensive evaluation metrics are used to analyze the experimental results. These metrics include content diversity, factual accuracy, linguistic toxicity, and pedagogical alignment. Experimental results show that RCEG significantly improves the relevance and cognitive appropriateness of the generated exercises.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.1886

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Chongqing Province > Chongqing (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Education > Assessment & Standards > Student Performance (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Densely Connected Attention Propagation for Reading Comprehension

Yi Tay, Anh Tuan Luu, Siu Cheung Hui, Jian Su

Neural Information Processing SystemsNov-20-2025, 23:33:02 GMT

We conduct extensive experiments on four challenging RC benchmarks.

arxiv preprint arxiv, machine learning, natural language, (14 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)

Industry: Education > Assessment & Standards > Student Performance (0.42)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays

Hua, Haowei, Jiao, Hong, Wang, Xinyi

arXiv.org Artificial IntelligenceNov-20-2025

The majority of summarized essays fall well below the 512 - token limit (marked by the red dashed line), indicating that the summarization process effectively compressed the original texts while maintaining consistency in length. The smooth decline beyond 300 tokens and the sparse occurrence of samples approaching the upper l imit suggest that v ery few summaries exceeded the intended compression threshold. Overall, this distribution demonstrates that the GPT - 5 - mini summarizer produced concise and length - stable representations, ensuring efficient model input handling and minimizing the risk of truncation in downstream processing.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.2283

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > New York (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry:

Education > Assessment & Standards > Student Performance (0.73)
Food & Agriculture > Agriculture (0.68)
Information Technology > Security & Privacy (0.46)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.31)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection

Miandoab, Kaveh Eskandari, Kowalyshyn, Katharine, Pamnani, Kabir, Gavhera, Anesu, Sarathy, Vasanth, Scheutz, Matthias

arXiv.org Artificial IntelligenceNov-19-2025

IntelliProof structures an essay as an argumentation graph, where claims are represented as nodes, supporting evidence is attached as node properties, and edges encode supporting or attacking relations. Unlike existing automated essay scoring systems, IntelliProof emphasizes the user experience: each relation is initially classified and scored by an LLM, then visualized for enhanced understanding. The system provides justifications for classifications and produces quantitative measures for essay coherence. It enables rapid exploration of argumentative quality while retaining human oversight. In addition, IntelliProof provides a set of tools for a better understanding of an argumentative essay and its corresponding graph in natural language, bridging the gap between the structural semantics of argumentative essays and the user's understanding of a given text.

argument, large language model, natural language, (12 more...)

arXiv.org Artificial Intelligence

2511.04528

Country:

Europe > Austria > Vienna (0.15)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.05)
Europe > Switzerland (0.05)

Genre: Research Report (0.40)

Industry:

Education > Assessment & Standards > Student Performance (0.57)
Education > Educational Technology > Educational Software (0.56)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

Byun, Grace, Rajwal, Swati, Choi, Jinho D.

arXiv.org Artificial IntelligenceNov-19-2025

Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.10819

Country: